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1.  Introduction 


Data  mining,  ever  since  it  was  formally  started  as  an  independent  research  area,  has  been 
advancing  rapidly  over  the  last  few  years  with  extensive  applications  identified  ranging 
from  government  to  industries  to  even  people’s  daily  life  [Han2006].  Community 
generation,  as  a  recently  emerging  research  topic  in  data  mining  area  [Zhang2003, 
Salemo2004],  has  become  one  of  the  hottest  research  foci  in  the  data  mining  community 
[Domingos2001,  Mannila2002,  Smyth200 1 ,  Long2005a,  Long2006a,  Long2006b].  The 
motivation  for  the  research  on  community  generation  is  due  to  the  fact  that  the 
technologies  developed  from  this  research  has  found  substantial  application  areas  in 
governmental  and  industrial  sectors,  such  as  fraud  detection  [Goldberg  1997,  Jensen  1997, 
Stolfol997],  crime  investigation  [Senator  1995,  Zhang2003,  Salemo2004],  sales 
promotion  [Brin97a,  Brin97b],  social  network  analysis  [Getoor2002,  Kersting2000, 
Sarwar2001,  Shardanandl995,  Scottl991,  Wassermannl994,  Jensen2002,  Glymour2001, 
Glymourl999],  and  Web  mining  [Aggarwal2001,  Whitel996,  Mack2002,  Sarwar2001, 
Gibsonl998],  to  just  name  a  few.  On  the  other  hand,  as  of  today,  there  is  no  well- 
established  theory  developed  in  the  research  on  this  topic.  Given  this  context,  this  project 
aims  to  develop  the  theory  on  community  generation  as  well  as  to  identify  the 
applications  using  the  theory  developed  in  this  project. 

Prior  to  the  project,  the  PI  has  initiated  the  research  on  community  generation  in 
collaboration  with  his  AFRL  mentor,  Dr.  John  Salerno,  who  is  also  the  program  manager 
of  this  project.  In  this  preliminary  research,  we  have  identified  the  new  paradigm  of  Uni¬ 
party  Data  Community  Generation  (UDCG),  in  contrast  to  the  existing  work  in  the 
literature  on  Bi-party  Data  Community  Generation  (BDCG),  which  is  also  referred  to  as 
relational  data  community  generation.  This  preliminary  work  served  as  the  foundation  for 
this  project,  and  received  significant  publication  [Zhang2002,  Zhang2003,  Salemo2004], 

This  project  is  funded  through  Information  Institute  for  the  tenn  between  June,  2004,  and 
June,  2006  for  the  total  funding  scale  of  $80,000.  Under  this  project,  one  of  the  Pi’s  PhD 
students,  Bo  Long,  was  supported.  We  have  made  excellent  progress  with  very 
impressive  achievements  in  this  project  [Long2005a,  Long2005b,  Long2006a, 
Long2006b].  Due  to  these  impressive  achievements,  we  have  received  substantial 
publicity  in  the  data  mining  community  and  have  received  invitations  for  collaborations 
from  the  major  industrial  and  governmental  research  and  development  organizations 
including  Microsoft  Research,  IBM  Research,  Microsoft  MSN,  Google,  Yahoo!,  as  well 
as  DOE  Berkeley/Lawrence  Lab.  Part  of  the  technical  achievements  are  being  considered 
as  the  Tech  Transfer  Office  of  SUNY  Binghamton  for  possible  further  applications  to  US 
Patents.  In  the  following  sections,  we  briefly  summarize  the  major  achievements  in  this 
project. 
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2.  Block  Value  Decomposition  for  Dyadic  Data  Community 
Generation 


The  clustering  is  used  in  community  generation  in  many  disciplines  and  has  a  wide  range 
of  applications.  In  many  applications,  such  as  document  clustering,  collaborative 
filtering,  and  microarray  analysis,  the  data  can  be  formulated  as  a  two-dimensional  matrix 
representing  a  set  of  dyadic  data.  Dyadic  data  refer  to  a  domain  with  two  finite  sets  of 
objects  in  which  observations  are  made  for  dyads,  i.e.,  pairs  with  one  element  from  either 
set.  For  the  dyadic  data  in  these  applications,  co-clustering  both  dimensions  of  the  data 
matrix  simultaneously  is  often  more  desirable  than  traditional  one-way  clustering.  This  is 
due  to  the  fact  that  co-clustering  takes  the  benefit  of  exploiting  the  duality  between  rows 
and  columns  to  effectively  deal  with  the  high  dimensional  and  sparse  data  that  is  typical 
in  many  applications.  Moreover,  there  is  an  additional  benefit  for  co-clustering  to  provide 
both  row  clusters  and  column  clusters  at  same  time.  For  example,  we  may  be  interested  in 
simultaneously  clustering  genes  and  experimental  conditions  in  bioinfonnatics 
applications,  simultaneously  clustering  documents  and  words  in  text  mining, 
simultaneously  clustering  users  and  movies  in  collaborative  filtering. 

In  this  work  [Long2005a],  we  have  developed  a  new  co-clustering  framework  called 
Block  Value  Decomposition  (BVD).  The  key  idea  is  that  the  latent  block  structure  in  a 
two-dimensional  dyadic  data  matrix  Z  can  be  explored  by  its  triple  decomposition.  The 
dyadic  data  matrix  is  factorized  into  three  components,  the  row-coefficient  matrix  R,  the 
block  value  matrix  B ,  and  the  column-coefficient  matrix  C,  as  shown  in  Eq.  1.  The 
coefficients  denote  the  degrees  of  the  rows  and  columns  associated  with  their  clusters  and 
the  block  value  matrix  is  an  explicit  and  compact  representation  of  the  hidden  block 
structure  of  the  data  matrix. 

Under  this  framework,  we  develop  a  specific  novel  co-clustering  algorithm  for  a  special 
yet  very  popular  case  —  non-negative  dyadic  data  that  iteratively  computes  the  three 
decomposition  matrices  based  on  the  multiplicative  updating  rules  derived  from  an 
objective  criterion.  By  intertwining  the  row  clusterings  and  the  column  clusterings  at 
each  iteration,  the  algorithm  performs  an  implicitly  adaptive  dimensionality  reduction, 
which  works  well  for  typical  high-dimensional  and  sparse  data  in  many  data  mining 
applications.  The  algorithm  has  been  implemented  in  two  cases,  one  for  the  asymmetric 
dyadic  data  matrix  called  NBVD,  and  the  other  for  symmetric  dyadic  data  matrix  called 
symmetric  NBVD.  We  have  proven  the  correctness  of  the  algorithm  by  showing  that  the 
algorithm  is  guaranteed  to  converge  and  have  conducted  extensive  experimental 
evaluations  to  demonstrate  the  effectiveness  and  potential  of  the  framework  and  the 
algorithms. 


Z  «  RBC 

Figures  1  and  2  shows  two  examples  of  the  BVD  applications  and  Figure  3  gives  the 
conceptual  illustration  for  the  BVD  operation. 
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Figure  1  -  BVD  application  example  of  document  clustering 
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Figure  2  -  Another  example  of  BVD  as  clustering  the  proximity  matrix 
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Figure  3  -  Conceptual  illustration  of  the  BVD  operation 


We  have  used  the  20  newsgroup  data  for  evaluations  of  the  BVD  framework  through  the 
two  cases  of  the  algorithm.  For  evaluations  of  NBVD,  we  compare  its  performance  with 
those  of  Non-negative  Matrix  Factorization  (NMF)  [Lee  1999],  Information-theoretic  Co- 
Clustering  (ICC)  [Dhillon2003],  and  Iterative  Double  Clustering  (IDC)  [El-Yaniv2001]. 
Table  1  shows  the  performance  comparison  in  terms  of  the  precision  values  for  three 
different  20  newsgroup  data  sets  (binary,  multi5,  and  multi  10),  which  clearly 
demonstrates  the  superiority  of  BVD  framework. 


Table  1  -  Performance  comparison  among  NBVD  and  NMF,  ICC,  and  IDC 


NBVD 

NMF 

ICC 

IDC 

Binary 

0.95 

0.91 

0.96 

0.85 

Multi5 

0.93 

0.88 

0.89 

0.88 

Multi  10 

0.67 

0.60 

0.54 

0.55 
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For  NBVD  symmetric,  we  applied  it  to  the  proximity  matrix  partition  problem  and 
compared  it  with  Normalized  Cut  (NC)  [Shi2000]  and  Average  Association  (AA) 
[Zha2002].  Figure  4  documents  the  performance  comparison  which  once  again 
demonstrates  the  superiority  of  BVD  framework. 


Figure  4  -  Performance  comparison  among  NBVD  symmetric  and  NC  and  AA 

3.  Clustering  Ensemble  for  Community  Generation 

After  the  community  generation,  there  is  a  question  of  how  good  the  community 
generation  quality  is.  This  is  true  when  we  apply  different  community  generation 
algorithms  to  the  same  data  collection,  or  even  apply  the  same  community  generation 
algorithm  with  different  parameters  to  the  same  data  collection.  Given  such  different 
community  generation  results  for  the  same  original  data  collection,  which  one  shall  we 
pick  up?  Without  any  a  priori  knowledge  regarding  the  data  collection,  we  really  cannot 
decide  which  one  is  the  best.  The  only  solution  we  can  take  is  to  combine  them  all  to 
obtain  the  best  solution. 

This  approach  becomes  mandatory  when  we  are  in  the  scenario  when  the  data  collection 
is  distributed  over  a  network  and  each  site  of  the  network  for  community  generation  can 
only  access  to  part  of  the  whole  global  data  collection  that  is  distributed  over  the  network. 
In  this  case,  since  each  site  is  only  visible  to  a  part  of  the  whole  global  data,  it  is  expected 
that  the  community  generation  result  at  this  site  is  not  perfect.  Therefore,  given  the 
individual  community  generation  results  at  all  the  different  sites,  it  is  mandatory  to 
combine  all  the  results  obtained  at  all  the  different  sites  together  to  secure  the  best  result 
for  the  whole  data  collection. 

Another  common  scenario  similar  to  the  distributed  data  collection  is  the  privacy¬ 
preserving  data  mining  in  which  each  party  is  only  visible  to  part  of  the  whole  data 
collection  due  to  the  privacy-preserving  requirement  and  thus  the  community  generation 
at  this  party  can  only  be  done  based  on  the  part  of  whole  data  collection.  Consequently, 
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after  all  the  parties  obtain  their  own  community  generation  results,  it  is  necessary  to 
obtain  the  community  generation  across  all  the  parties  for  the  best  result.  Figure  5 
illustrates  the  scenario  for  clustering  ensemble. 


Figure  5  -  Conceptual  illustration  of  the  community  generation  ensemble 


Based  on  these  motivations,  we  have  developed  a  clustering  ensemble  approach  called 
Soft  Correspondence  Ensemble  Clustering  (SCEC)  for  clustering  based  community 
generation  ensemble  [Long2005b].  Figure  6  gives  the  outlines  of  the  SCEC  algorithm. 


Algorithm  1  SCEC(A/(1), . . . ,  Af(*r),  k) 

1:  Initialize  A/,  S(1), . . . ,  S(r). 

2:  while  convergence  criterion  of  M  is  not  satisfied  do 
3:  for  h •  =  1  to  r  do 

4:  while  convergence  criterion  of  S{  h  '  is  not  satisfied 

do 

5.  5(h)  5(h)  0  (A/<M)TAf+gfcl khk 

6:  end  while 

7:  end  for 

8:  M  =  |  YH=lM(h)S(h) 

9:  end  while 

Figure  6  -  SCEC  Algorithm  outlines 
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The  key  technical  challenge  for  community  generation  ensemble  is  to  find  the  community 
correspondences  between  different  individual  community  generation  results.  In  the 
literature,  all  the  existing  methods  have  a  strong  assumption  that  individual  communities 
generated  in  different  results  must  satisfy  one-to-one  correspondence.  This  assumption  is 
clearly  hot  true  in  many  applications.  The  major  significance  and  intellectual  merit  of 
SCEC  is  that  in  SCEC  we  propose  soft  correspondence  instead  of  this  kind  of  “hard” 
correspondence  such  that  the  one-to-one  correspondence  requirement  is  removed  and 
thus  the  algorithm  may  be  applied  to  any  scenarios  without  such  strong  assumption.  It 
can  be  shown  that  SCEC  can  always  deliver  the  best  ensemble  solution. 

Another  advantage  of  SCEC  is  that  due  to  its  soft  correspondence  capability,  it  can 
handle  the  scenarios  where  there  are  missing  attribute  values  in  the  data  collection.  SCEC 
can  “automatically”  take  care  of  the  missing  values  and  is  still  able  to  deliver  the  best 
solution.  Finally,  SCEC  promises  to  be  an  efficient  algorithm  in  comparison  with  the 
existing  methods  in  the  literature. 

We  used  the  open  source  UCI  data  sets  to  evaluate  the  SCEC  algorithm.  Specifically,  we 
used  the  IRIS,  PENDIG,  and  ISOLET6  data  sets.  We  compared  SCEC  with  four  existing 
methods  from  the  literature:  clustering  based  similarity  partitioning  algorithm  (CSPA) 
[Strehl2002],  metal  clustering  algorithm  (MCLA)  [Strehl2002],  quadratic  mutual 
information  (QMI)  [Topchy2003],  and  mixture  model  based  ensemble  clustering 
(MMEC)  [Topchy2004],  as  well  as  the  baseline  k-means.  For  each  comparison  scenario 
for  each  data  set,  we  try  three  different  cases:  random  initialization  (RI),  random  number 
(RN),  and  random  subsets  (RS).  Figure  7  documents  the  overall  performance  comparison 
between  SCEC  and  the  existing  methods  and  the  k-means  baseline  for  the  nine  scenarios 
(three  data  sets  with  the  three  initialization  cases).  From  the  figure,  it  is  clear  that  SCEC 
outperforms  all  the  existing  methods  and  the  baseline  method  in  most  of  the  cases. 


AVERGAE  NMI  FOR  NINE  SETTINGS 


Figure  7  -  Performance  evaluations  between  SCEC  and  the  existing  methods  and  the  baseline  k- 
means  under  the  nine  different  scenarios 
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4.  Spectral  Approach  to  Bi-party  Data  Community 
Generation 


Most  clustering  based  bi-party  data  community  generation  approaches  in  the  literature 
focus  on  "flat"  data  in  which  each  data  object  is  represented  as  a  fixed-length  feature 
vector.  However,  many  real-world  data  sets  are  much  richer  in  structure,  involving 
objects  of  multiple  types  that  are  related  to  each  other,  such  as  Web  pages,  search  queries 
and  Web  users  in  a  Web  search  system,  and  papers,  key  words,  authors  and  conferences 
in  a  scientific  publication  domain.  In  such  scenarios,  using  traditional  methods  to  cluster 
each  type  of  objects  independently  may  not  work  well  due  to  the  following  reasons. 

First,  to  make  use  of  relation  information  under  the  traditional  clustering  framework,  the 
relation  information  needs  to  be  transfonned  into  features.  In  general,  this  transformation 
causes  information  loss  and/or  very  high  dimensional  and  sparse  data.  For  example,  if  we 
represent  the  relations  between  Web  pages  and  Web  users  as  well  as  search  queries  as  the 
features  for  the  Web  pages,  this  leads  to  a  huge  number  of  features  with  sparse  values  for 
each  Web  page.  Second,  traditional  clustering  approaches  are  unable  to  tackle  with  the 
interactions  among  the  hidden  structures  of  different  types  of  objects,  since  they  cluster 
data  of  single  type  based  on  static  features.  Note  that  the  interactions  could  pass  along  the 
relations,  i.e.,  there  exists  influence  propagation  in  multi-type  relational  data.  Third,  in 
some  machine  learning  applications,  users  are  not  only  interested  in  the  hidden  structure 
for  each  type  of  objects,  but  also  the  global  structure  involving  multi-types  of  objects.  For 
example,  in  document  clustering,  except  for  document  clusters  and  word  clusters,  the 
relationship  between  document  clusters  and  word  clusters  is  also  useful  information.  It  is 
difficult  to  discover  such  global  structures  by  clustering  each  type  of  objects  individually. 

Therefore,  multi-type  relational  data  has  presented  a  great  challenge  for  traditional 
clustering  approaches.  In  this  study  [Long2006a],  first,  we  propose  a  general  model,  the 
collective  factorization  on  related  matrices,  to  discover  the  hidden  structures  of  multi¬ 
types  of  objects  based  on  both  feature  information  and  relation  information.  By  clustering 
the  multi-types  of  objects  simultaneously,  the  model  performs  adaptive  dimensionality 
reduction  for  each  type  of  data.  Through  the  related  factorizations  which  share  factors, 
the  hidden  structures  of  different  types  of  objects  could  interact  under  the  model.  In 
addition  to  the  cluster  structures  for  each  type  of  data,  the  model  also  provides 
information  about  the  relation  between  clusters  of  different  types  of  objects  for 
identifying  the  global  community  structures  in  addition  to  the  local  clusterings.  Figure  8 
illustrates  such  an  example  in  the  application  of  Web  mining. 
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Figure  8  -  An  example  of  multi-type  relational  data  mining  for  identifying  the  global  community 
structures  in  addition  to  the  local  clusterings 


Second,  under  this  model,  we  derive  a  novel  algorithm,  the  spectral  relational  clustering, 
to  cluster  multi-type  interrelated  data  objects  simultaneously.  By  iteratively  embedding 
each  type  of  data  objects  into  low  dimensional  spaces,  the  algorithm  benefits  from  the 
interactions  among  the  hidden  structures  of  different  types  of  data  objects.  The  algorithm 
has  the  simplicity  of  spectral  clustering  approaches  but  at  the  same  time  also  applicable  to 
relational  data  with  various  structures.  Theoretic  analysis  and  experimental  results 
demonstrate  the  promise  and  effectiveness  of  the  algorithm. 

Third,  we  show  that  the  existing  spectral  clustering  algorithms  can  be  considered  as  the 
special  cases  of  the  proposed  model  and  algorithm.  This  provides  a  unified  view  to 
understand  the  connections  among  these  algorithms. 

Specifically,  we  propose  the  Collective  Factorization  on  Related  Matrices  (CFRM) 
model.  For  each  relation  in  the  data  collection  of  the  multi-type  relational  data,  we 
represent  the  relation  as  a  related  matrix  R.  According  to  the  BVD  framework  we  have 
developed  [Long2005a],  this  matrix  can  be  decomposed  into  three  components. 
Similarly,  given  a  feature  matrix  F,  it  can  be  considered  as  a  special  case  of  BVD  and 
then  can  be  decomposed  into  two  components  as  the  third  one  is  an  identity  matrix. 
Consequently,  we  have  Eq.  2  below  for  the  general  multi-type  relational  data  CFRM 
model: 


min  waiJ)  II  RiiJ)  -CV)Am(CU))T  ||2  + 

l</<  j<m 

Y  wb(i)  ||  F[i)  -CU)BU)  ||2  (2) 

1  <i<m 


Based  on  this  CFRM  model,  we  have  developed  an  algorithm  called  Spectral  Relational 
Clustering  (SRC)  for  clustering  based  community  generation  for  the  general  multi-type 
relational  data  scenario.  The  technical  significance  and  intellectual  merits  of  SRC  are  that 
it  is  as  simple  as  the  traditional  spectral  clustering  but  at  the  same  time  can  be  applied  to 
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relational  data  with  various  structures;  in  addition,  it  has  the  advantage  of  low  dimension 
embedding  during  the  clustering,  and  also  it  is  efficient  in  comparison  with  the  existing 
methods  in  the  literature. 

To  extensively  evaluate  the  performance  of  SRC,  we  use  the  20  newsgroup  data.  We 
compare  the  performance  of  SRC  with  those  of  the  existing  methods  in  the  literature: 
normalized  cut  (NC)  [Shi2000],  Bipartite  Spectral  Graph  Partitioning  (BSGP) 
[Dhillon2001],  Mutual  Reinforcement  K-means  (MRK)  [Long2006a],  and  Consistent 
Bipartite  Graph  Co-partitioning  (CBGC)  [Gao2005],  Figure  9  documents  the  evaluations 
of  the  bi-partite  graph  scenario  for  the  20  newsgroup  data  set,  and  Figure  10  documents 
the  evaluations  of  the  tri-partite  graph  scenario  for  the  same  20  newsgroup  data  set.  It  is 
noted  that  in  some  cases  where  the  data  sets  are  large,  CBGC  ran  out  of  the  memory  and 
so  the  data  were  not  available.  From  these  figures,  it  is  clear  that  SRC  outperforms  all  the 
comparing  methods  in  all  the  cases. 


NMI  Comparisons  on  Bi-type  Relational  Data 
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Figure  9  -  Evaluations  of  SRC  against  NC  and  BSGP  for  the  bi-partite  graph  scenario  using  the  20 
newsgroup  data  set 


10 


c 

o 

CD 

E 

o 

M— 

_c 

TO 

=S 


-a 

o 

N 

ro 

E 


NMI  Comparisons  on  Tri-type  Relational  Data 
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Figure  10  -  Evaluations  of  SRC  against  MRK  and  CBGC  for  the  tri-partite  graph  scenario  using  the 
20  newsgroup  data  set 

5.  Relation  Summary  Network  as  a  General  Framework 

Based  on  the  above  study  [Long2006a],  we  went  further  to  propose  a  more  general 
framework.  We  first  generalize  the  multi-type  relational  data  as  a  k-partite  graph. 

An  intuitive  attempt  to  mine  the  hidden  structures  from  k-partite  graphs  is  applying 
existing  graph  partitioning  approaches  to  k-partite  graphs.  This  idea  may  work  in  some 
special  and  simple  situations.  However,  in  general,  it  is  infeasible.  First,  the  graph 
partitioning  theory  focuses  on  finding  the  best  cuts  of  a  graph  under  a  certain  criterion 
and  it  is  very  difficult  to  cut  different  type  of  relations  (links)  simultaneously  to  identify 
different  hidden  structures  for  different  types  of  nodes.  Second,  by  partitioning  the  whole 
k-partite  graph  into  $m$  subgraphs,  one  actually  assumes  that  all  different  types  of  nodes 
have  the  same  number  of  clusters  $m$,  which  in  general  is  not  true.  Third,  by  simply 
partitioning  the  whole  graph  into  disjoint  subgraphs,  the  resulting  hidden  structures  are 
rough.  For  example,  the  clusters  of  different  types  of  nodes  are  restricted  to  one-to-one 
associations. 

Therefore,  mining  hidden  structures  from  k-partite  graphs  has  presented  a  great  challenge 
to  traditional  unsupervised  learning  approaches.  In  this  study  [Long2006b],  first  we 
propose  a  general  model,  the  relation  summary  network  (RSN),  to  find  the  hidden 
structures  (the  local  cluster  structures  and  the  global  community  structures)  from  a  k- 
partite  graph.  The  basic  idea  is  to  construct  a  new  k-partite  graph  with  hidden  nodes, 
which  "summarize"  the  link  information  in  the  original  k-partite  graph  and  make  the 
hidden  structures  explicit,  to  approximate  the  original  graph.  The  model  provides  a 
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principal  framework  for  unsupervised  learning  on  k-partite  graphs  of  various  structures. 
Second,  under  this  model,  based  on  the  matrix  representation  of  a  k-partite  graph  we 
reformulate  the  graph  approximation  as  an  optimization  problem  of  matrix  approximation 
and  derive  an  iterative  algorithm  to  find  the  hidden  structures  from  a  k-partite  graph 
under  a  broad  range  of  distortion  measures.  By  iteratively  updating  the  cluster  structures 
for  each  type  of  nodes,  the  algorithm  takes  advantage  of  the  interactions  among  the 
cluster  structures  of  different  types  of  nodes  and  performs  implicit  adaptive  feature 
reduction  for  each  type  of  nodes.  Experiments  on  both  synthetic  and  real  data  sets 
demonstrate  the  promise  and  effectiveness  of  the  proposed  model  and  algorithm.  Third, 
we  also  establish  the  connections  between  existing  clustering  approaches  and  the 
proposed  model  to  provide  a  unified  view  to  the  clustering  approaches. 


(a)  (b) 

Figure  11  -  An  example  of  RSN  model 

Figure  1 1  illustrates  an  example  of  the  RSN  model  where  (a)  is  an  original  tri-partite 
graph  and  (b)  is  its  corresponding  RSN  approximation.  Based  on  this  model,  we  use 
matrix  representation  for  all  the  relations.  Thus,  according  to  the  BVD  framework 
[Long2005a],  each  relation  can  be  decomposed  into  three  components,  and  thus  the  key 
to  the  problem  is  to  identify  the  closest  approximation  of  the  new  graph  with  the 
summary  nodes  to  the  original  graph. 

This  approach  calls  for  using  a  distance  metric  to  measure  the  “closeness”  between  the 
two  graphs.  We  have  developed  a  general  algorithm  under  this  RSN  model  that  is 
applicable  under  a  wide  spectrum  of  distance  functions  w.r.t.  different  distance 
distributions.  This  algorithm  is  called  Relation  Summary  Network  with  Bregman 
Divergence  (RSN-BD).  The  algorithm  is  listed  in  Figure  12.  Refer  to  [Fong2006b]  for 
the  details  of  the  algorithm. 
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Algorithm  1  Relation  Summary  Network  with  Bragman 
Divergences 

In  pot:  A  k -partite  gra.ph  G  =  (Vi . Vmi  E),  a  B  resin  an 

diverge  nee  funct i-.-n  D ^ .  and  m  positive  integers.  &T7, . 

O  ut.  put:  A n  RSN  G‘  =  ( Vi  i , „ ,  >  Vm  r  . Sm  .E*). 

Method : 

1:  Initialise  GB . 

2:  repeat 

3:  for  i  =  1  to  do 

4:  Update  the  edges  between  V,  and  according  t-_- 

Eq.(ll). 

5 :  e  nd  for 

6:  for  each  pair  of  St  ^  S3  where  1  <  i  <  j  <  m  do 

7:  Update  the  edges  beiween  St  and  S:J  according  to 

Eq.(13). 

8:  end  for 

3:  until  convergence 


Figure  12  -  RSN-BD  algorithm 

The  advantages  of  RSN-BD  includes:  (1)  Implicit  adaptive  dimensionality  reduction 
through  hidden  nodes;  (2)  Applicable  to  K-partite  graphs  of  various  structures;  (3) 
Applicable  to  graphs  with  different  probabilistic  distributions  on  their  edges;  and  (4) 
efficient. 

To  evaluate  the  RSN  model  as  well  as  the  RSN-BD  algorithm,  we  have  implemented 
RSN-BD  with  four  different  Bregman  divergence  functions:  Euclidean  distance  (RSN- 
ED),  logistic  loss  (RSN-LL),  generalized  I-divergence  (RSN-GI),  and  Itakura-Saito 
distance  (RSN-IS).  The  first  distance  function  corresponds  to  the  normal  distribution;  the 
second  corresponds  to  Bernoulli  distribution;  the  third  corresponds  to  Poisson 
distribution;  and  the  fourth  corresponds  to  the  exponential  distribution.  We  compare  RSN 
with  the  K-means  in  the  corresponding  four  cases  (ED,  LL,  GI,  and  IS),  BSGP 
[Dhillon2001],  and  CBGC  [Gao2005], 

We  used  simulated  graph  data  as  well  as  the  real  data  for  evaluations.  The  real  data  are 
the  20  newsgroup  data  set  in  which  we  generated  the  bi-partite  and  tri-partite  graphs.  In 
the  simulated  data,  we  generated  Bernoulli,  Poisson,  and  exponential  distributions.  Figure 
13  documents  the  evaluations  for  the  simulated  bi-partite  graphs  under  Bernoulli, 
Poisson,  and  exponential  distributions  among  all  the  algorithms’  performance.  Figure  14 
documents  the  performance  comparisons  among  all  the  algorithms  for  three  different  bi¬ 
partite  graphs  for  the  real  20  newsgroup  data  set.  Figure  15  documents  the  perfonnance 
comparison  among  all  the  algorithms  for  the  simulated  exponential  tri-partite  graph  and 
two  other  tri-partite  graphs  with  the  real  20  newsgroup  data  set.  From  these  figures,  it  is 
clear  that  RSN  model  as  well  as  the  algorithm  is  superior  to  the  comparing  methods. 
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NMI  scores  on  synthetic  bipartite  graphs 


Figure  13  -  Bi-partite  graphs  under  Bernoulli,  Poisson,  and  exponential  distributions  for  the 
performance  comparison  among  all  the  algorithms 
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Figure  14  -  Bi-partite  graphs  of  the  real  20  newsgroup  data  set  for  the  performance  comparison 
among  ail  the  algorithms 
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Figure  15  -  Tri-partite  graphs  of  simulated  exponential  distribution  and  two  scenarios  of  the  real  20 
newsgroup  data  set  for  the  performance  comparison  among  all  the  algorithms 

6.  Conclusions 

In  this  final  report,  we  summarize  the  technical  achievements  made  in  the  project,  as  well 
as  the  societal  and  broader  impacts  obtained  through  the  publicity  we  have  generated  in 
this  project.  The  project  was  extremely  successful  with  substantial  achievements  and 
publicity  generated.  I  hope  that  the  techniques  and  the  technologies  generated  from  the 
research  in  this  project  shall  be  useful  not  only  to  the  government,  but  also  to  the  public.  I 
hope  that  we  will  be  continued  to  be  supported  in  the  future. 
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